Here is my mini-project on analysis of tHPI dataset. In the first try, late October 2015, The November and December prices were not determined yet, so I made predictions for them. Later, once the prices were anounced I compared my predictions with the final prices. Results were pretty okey, except the predictions of some cities such as Houston, Las Vegas and Denver for which prices had dropped further than expectations!

Data Processing

I start by importing a recent tHPI which shows the average price in USD for a standard double room, based on 25 of the most popular US cities. I also add the Latitude and Longitude of these 25 cities, which will be used for fancier visualizations. Here is the tHPI for top 6 cities of US.

##           city January February March April May June July August September
## 1     New York     243      245   297   377 410  372  317    320       430
## 2  Los Angeles     197      196   203   219 219  227  252    245       220
## 3      Chicago     154      163   213   259 333  311  299    267       293
## 4       Dallas     148      143   154   158 150  148  145    147       157
## 5 Philadelphia     172      189   207   228 246  244  200    198       247
## 6      Houston     149      157   157   161 173  146  149    143       150
##   October November December      lat        lon
## 1     430      378      365 40.71278  -74.00594
## 2     219      210      198 34.05223 -118.24368
## 3     307      259      189 41.83690  -87.68470
## 4     163      152      136 32.77670  -96.79700
## 5     241      217      191 39.95000  -75.16670
## 6     154      144      125 29.76040  -95.36980

I reshape the spreadsheet data into a four column data frame, with city, month, price, and date in first, second, third and forth columns, respectively. The first six rows of the new data frame are as following.

##           city   month price       date
## 1     New York January   243 0015-01-01
## 2  Los Angeles January   197 0015-01-01
## 3      Chicago January   154 0015-01-01
## 4       Dallas January   148 0015-01-01
## 5 Philadelphia January   172 0015-01-01
## 6      Houston January   149 0015-01-01
##            city   month price       date
## 245 New Orleans October   234 0015-10-01
## 246   St. Louis October   164 0015-10-01
## 247     Seattle October   228 0015-10-01
## 248 Minneapolis October   257 0015-10-01
## 249 San Antonio October   127 0015-10-01
## 250    San Jose October   209 0015-10-01

A few minor data processing tasks will be done later during the EDA and modeling phases. Next, I’ll conduct Exploratory Data Analysis (EDA) for getting more insights into the dataset.

Exploratory Data Analysis (EDA)

The following three plots, show the density of hotel prices.

According to the first plot, most of the hotel prices are around $150, however this price can go up to around $450. Based on the second plot, in January, most of the cities have the $150 price while in e.g., May the prices are much higher. The third plot, decomposes the price distribution of different cities. Which is hard to analyse as it is. Therefore, I continue with the following two boxplots.

Actually, the above two plots became my favorites. The former one, illustrates the high-season. The average price starts increasing in April and is maximum in July. Then drops in August.

The latter plot, immediately shows different type of cities. Boston and New York have very high prices, and their prices vary very much. While, Houston or Orlando are cheaper and much more stable.

The following heatmap illustrates how prices change across months and cities.

The lower right corner of the above heatmap, roughly, captures the cities and months which I see the highest prices. In this heatmap, I can also differentiate between cities with stable prices and cities with very varying prices.

Here I try to make categories for cities. First, I categorize each city as “Cheap Cities”, “Moderate Cities” and “Expensive Cities”. Second, I categorize them as “Stable Cities” and “Unstable Cities”. Before this categorization, let me extract the mean, standard deviation and range of prices from January to October, for each city. Following comes these statistics for the top 6 cities of US.

##           city price.mean  price.sd price.range price.mean.class
## 1     New York      344.1 70.345891         187        (284,359]
## 2  Los Angeles      219.7 18.541545          56        (210,284]
## 3      Chicago      259.9 62.939918         179        (210,284]
## 4       Dallas      151.3  6.429965          20        (135,210]
## 5 Philadelphia      217.2 27.336585          75        (210,284]
## 6      Houston      153.9  8.685237          30        (135,210]
##   price.sd.class price.range.class
## 1    (44.5,82.9]         (126,235]
## 2    (6.12,44.5]        (17.8,126]
## 3    (44.5,82.9]         (126,235]
## 4    (6.12,44.5]        (17.8,126]
## 5    (6.12,44.5]        (17.8,126]
## 6    (6.12,44.5]        (17.8,126]

The change of prices in each category are shown in following plots.

Fries: Cheap, Pizza: Moderate, Steak: Expensive Horse: Stable, Bull: Unstable

Using all the above plots, my intuition is that fluctuation of prices are following certain patterns. I illustrate some patterns, namely up-down-up, up-up-down-up, up-up-up-up, up-up-up-down in the following four subplots.

Based on my observations in above plots, and considering that the traveling patterns follow the climate changes, school holidays (and other yearly and 6-monthly periodic events), I expect that a linear combination of some Sinus and Cosinus functions with periods of 12 months and 6 months can estimate the trend of prices.

Modeling

It seems that every city is following a periodic trend. For this, I define these four predictors: \(sin(\frac{\pi t}{6}),cos(\frac{\pi t}{6}),sin(\frac{\pi t}{3}),cos(\frac{\pi t}{3})\) where \(t\) is the number of month ranging from 1 to 12 for January to December. Then, I use a linear model to predict the price of each month. See following two R functions.

# generating predictors for linear models
get.predictors <- function(t) {
  return(data.frame(t=t,
           sin = sin(pi/6*t), # 12 month period
           cos = cos(pi/6*t),
           sin2 = sin(pi/3*t), # 6 month period
           cos2 = cos(pi/3*t)
           )
         )
}

# predicting prices based on the existing prices
get.predictions <- function(price.trend,time.end) {
  lmfit <- lm(y ~ sin+cos+sin2+cos2,data=price.trend)
  prediction.time <- 1:time.end
  return(price.prediction <- data.frame(t=prediction.time,y=predict(lmfit,get.predictors(prediction.time))))
}

Estimating October Prices (Testing)

To test this model I train one model for each city, based on the prices from January to September, and then compare its predictions for October prices with the real values.

After visually testing the predicted prices of October, I realized that for some of the cities the predictions are very promissing, while for some other cities predictions fail.

Some good predictions of October price are shown in following plots (the black dots are real values and the red dots and lines are my predictions).

For some other cities, the predictions don’t work that well. This can be because of unknown predictors that our model doesn’t take into account, or other reasons!

Overall, the predictions are not very bad; the trend of prices is usually followed by the model. Therefore, next, I use my model to predict the prices of November and December.

Verification

I use all available prices from January to October to train my model and then I make predictions for November and December prices. The predictions for all 25 US cities are illustrated in following (The black dots and lines show the real prices and red lines show the predictions).

Following comes my predictions for prices in next months which can be compared with the real prices shown next to them.

##               city November November_aprox December December_aprox
## 1         New York      378            418      365            327
## 2      Los Angeles      210            201      198            195
## 3          Chicago      259            279      189            219
## 4           Dallas      152            161      136            153
## 5     Philadelphia      217            238      191            206
## 6          Houston      144            156      125            153
## 7  Washington D.C.      290            321      201            263
## 8            Miami      196            193      203            219
## 9          Atlanta      172            179      152            172
## 10          Boston      347            408      228            314
## 11   San Francisco      279            307      210            260
## 12         Detroit      159            169      149            167
## 13         Orlando      141            130      149            133
## 14       San Diego      173            149      141            144
## 15       Las Vegas      178            185      118            170
## 16         Phoenix      135            132      116            140
## 17    Indianapolis      127            127      122            117
## 18  Salt Lake City      129            141      114            134
## 19          Denver      178            197      148            186
## 20     New Orleans      200            219      179            209
## 21       St. Louis      151            153      118            137
## 22         Seattle      192            185      167            172
## 23     Minneapolis      205            232      160            198
## 24     San Antonio      117            107      110            108
## 25        San Jose      199            192      179            177

Clustering based on the Periodic Model

While training the models, for each city I get 4 model coefficients. I assume that these coefficients are the features for the cities. This allows me to use a K-means clustering algorithm to see if the coefficients can help me to make clusters for the cities.

A simple model with k=4 clusters have a goodness of 74.4%. The following figure illustrates the 25 cities on a map and each cluster is shown by a color.

The above map already captures some of the geographical and climate characteristics of cities. For instance, Seattle which indeed has a different climate from other cities is the single member of its own cluster. Besides, many cities in central and western parts of the US which have similar climates end up in the same cluster. These clusters should be studied further in the future.

Summary and Conclusion

In this technical report, I started by importing the tHPI data and applied some basic modifications and reshaping to data. Then I conducted EDA to get insights into this dataset. As a result of my EDA I realized that prices follow periodic patterns. Therefore, I decided to use \(Sinus\) and \(Cosinus\) functions to fit a model to data. First, I tested my model by leaving out the October prices from training data and then I visually compared my predictions with the real prices. Then, I trained my model based on all available prices and made predictions for November and December prices. Also, coefficients of my model helped me to summarize each city with four numeric features which I used to cluster the cities.

I emphasize that the analysis provided in this report is very basic, and the choices might be naive in some cases. To improve the results, more domain knowledge, more data and deeper analysis are required. Next section elaborates on some of the possible future improvements.

Future Perspective

If I find more time, I’ll do the following.

  1. Using other tHPI Datasets
  1. Collecting new Data
  1. Feedback loop - We should be careful with the effect of our predictions on future prices. The prices are determined in a negotiating process; by making predictions of future prices we might influence this negotiation process and this can result in emergence of new patterns.
  2. Pricing Anomalies - If I look at the monthly estimations and real world trends, I see some anomalies! For example in Houston the prices in May and June show a significant deviation from the expected periodic pattern. Also, in Atlanta prices of July and August deviate. July of San Diago and San Antonio should be explored more, too.
  3. The Periodic model I trained in this report was tested by leaving out the prices of October and then comparing estimations with real values for this month. However, I didn’t provide a concrete accuracy measure for this model. This can be further explored.